MTB Categorization Analysis
Introduction
For this project, our team will determine whether the specifications of mountain bikes (MTB) are enough to differentiate between the different types of mountain bike categories.
Currently, full suspension mountain bikes come in multiple categories:
- Cross Country (XC) | Tend to be the most lightweight, nimble, and designed to put the rider in an efficient pedaling position
- Enduro | Heavier frames, more travel and more downhill oriented geometry
- Trail | The most common category of bikes, considered to be the halfway point between XC and Enduro
- All Mountain | A more niche category which some manufacturers claim to be more downhill focused than trail bikes, but not designed for downhill races like Enduro bikes are
- Downcountry | A relatively new category between XC and Trail. Similar to the All Mountain category, these bikes aren’t race specific like XC bikes tend to be, but are lighter and faster than trail bikes.
With all of the factors to consider when designing a bike, there are no clear boundaries between these categories. For example, one brand’s Downcountry bike could be what another brand considers a Trail bike.
The goal of our project is to determine how many, if any, discrete categories should exist for mountain bikes. Since most specifications and geometric measurements have one direction when moving across the spectrum of bikes, it’s reasonable to believe that these measurements could be reduced to much fewer dimensions, and perhaps even one continuous principle component rather than discrete categories. Here is a diagram of some of the different types of geometric specifications on mountain bikes:
Various Dimension Features of a Bike’s Geometry
Let’s start by taking a look at the data.
# Read in sheet 2 of our data
mtb_data <- read_excel(here::here('Data/mtb_stats.xlsx'), 'Sheet1')
mtb_data <- mtb_data %>%
# Clean up the label column
mutate(label = str_replace_all(str_to_lower(label), '[:punct:]', ''))
# Pull out the class labels
labels <- mtb_data %>%
select(label)
# Let's view the mtb_data output
# In any kable outputs, display NAs as blanks
opts <- options(knitr.kable.NA = "")
mtb_data %>%
head(25) %>%
# Fix up the headers by replacing the underscores with spaces
rename_all(funs(str_replace_all(., "_", " "))) %>%
# Make everything proper capitalization
# rename_all(funs(str_to_title)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = F,
font_size = 12) %>%
# Make the header row bold and black so it's easier to read
row_spec(0, bold = T, color = "black") %>%
scroll_box(height = "400px", width = "100%")| model | brand | build type | price | url | image | setting | size used | label | rear travel | fork travel | f piston | f rotor dim | r piston | r rotor dim | head angle | seat angle | crank length | stem length | handlebar width | reach | stack | wheelbase | chainstay length | bb height | standover height |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| habit | cannondale | L | tr | 130 | 140 | 4 | 180 | 4 | 180 | 66.0 | 74.5 | 780 | 460.0 | 625.0 | 1210.0 | 435.0 | 339.0 | 770.0 | |||||||
| scalpel | cannondale | L | xc | 100 | 100 | 2 | 160 | 2 | 160 | 68.0 | 74.5 | 80 | 760 | 435.0 | 601.0 | 1175.0 | 436.0 | 331.0 | 745.0 | ||||||
| scalpel se | cannondale | L | dc | 120 | 120 | 2 | 160 | 2 | 160 | 67.0 | 74.0 | 780 | 450.0 | 611.0 | 1172.0 | 436.0 | 344.0 | 758.0 | |||||||
| reign advanced pro | giant | L | en | 146 | 170 | 4 | 203 | 4 | 203 | 64.6 | 76.4 | 40 | 800 | 488.0 | 631.0 | 1262.0 | 439.0 | 781.0 | |||||||
| trance advanced X pro | giant | high | L | tr | 135 | 150 | 4 | 203 | 4 | 180 | 66.2 | 77.9 | 50 | 800 | 494.0 | 624.0 | 1238.0 | 435.0 | 761.0 | ||||||
| trance advanced X pro | giant | low | L | tr | 135 | 150 | 4 | 203 | 4 | 180 | 65.5 | 77.2 | 50 | 800 | 486.0 | 631.0 | 1239.0 | 438.0 | 752.0 | ||||||
| anthem advanced pro | giant | L | xc | 90 | 100 | 2 | 180 | 2 | 160 | 69.0 | 73.5 | 80 | 780 | 454.0 | 594.0 | 1154.0 | 438.0 | 817.0 | |||||||
| jet 9 rdo | niner | high | M | tr | 120 | 130 | 4 | 180 | 4 | 180 | 66.5 | 76.0 | 40 | 800 | 450.0 | 613.0 | 1179.0 | 430.0 | 698.0 | ||||||
| jet 9 rdo | niner | low | M | tr | 120 | 130 | 4 | 180 | 4 | 180 | 66.0 | 75.5 | 40 | 800 | 444.0 | 617.0 | 1180.0 | 432.0 | 705.0 | ||||||
| rip 0 rdo | niner | high | M | tr | 140 | 150 | 4 | 180 | 4 | 180 | 66.0 | 75.8 | 800 | 440.0 | 615.0 | 1181.0 | 435.0 | 712.0 | |||||||
| rip 0 rdo | niner | low | M | tr | 140 | 150 | 4 | 180 | 4 | 180 | 65.0 | 75.2 | 800 | 433.0 | 619.0 | 1182.0 | 435.0 | 705.0 | |||||||
| rkt 9 rdo | niner | M | dc | 90 | 120 | 4 | 180 | 4 | 160 | 70.0 | 73.5 | 780 | 413.0 | 617.0 | 1111.0 | 439.0 | 739.0 | ||||||||
| rkt 9 rdo rs | niner | M | xc | 90 | 100 | 4 | 180 | 4 | 160 | 71.0 | 74.5 | 780 | 424.0 | 600.0 | 1103.0 | 439.0 | 728.0 | ||||||||
| megatower | santa cruz | L | en | 160 | 160 | 4 | 200 | 4 | 200 | 65.0 | 76.6 | 470.0 | 625.0 | 1231.0 | 435.0 | 343.0 | 713.0 | ||||||||
| tallboy | santa cruz | L | tr | 120 | 130 | 4 | 180 | 4 | 180 | 65.7 | 76.4 | 50 | 800 | 470.0 | 619.0 | 1211.0 | 430.0 | 335.0 | 706.0 | ||||||
| hightower | santa cruz | L | tr | 145 | 150 | 4 | 180 | 4 | 180 | 65.5 | 76.8 | 50 | 780 | 473.0 | 619.0 | 1231.0 | 433.0 | 344.0 | 717.0 | ||||||
| blur | santa cruz | L | xc | 100 | 100 | 2 | 160 | 2 | 160 | 69.0 | 74.0 | 750 | 460.0 | 598.0 | 1160.0 | 432.0 | 328.0 | 723.0 | |||||||
| blur tr | santa cruz | L | dc | 115 | 120 | 2 | 180 | 2 | 180 | 67.1 | 74.9 | 175 | 60 | 760 | 457.5 | 606.5 | 1183.2 | 435.8 | 339.6 | 745.4 | |||||
| ransom | scott | L/29 | en | 170 | 170 | 4 | 203 | 4 | 180 | 64.5 | 75.0 | 50 | 800 | 466.5 | 627.6 | 1249.2 | 437.9 | 353.0 | 760.9 | ||||||
| spark | scott | L | tr | 120 | 130 | 4 | 180 | 4 | 180 | 67.2 | 73.8 | 70 | 760 | 460.0 | 602.4 | 1182.8 | 438.0 | 327.0 | 778.0 | ||||||
| genius | scott | high | L | tr | 150 | 150 | 4 | 203 | 4 | 180 | 65.6 | 75.3 | 50 | 780 | 472.0 | 609.2 | 1230.8 | 436.0 | 340.0 | 749.5 | |||||
| genius | scott | low | L | tr | 150 | 150 | 4 | 203 | 4 | 180 | 65.0 | 74.8 | 50 | 780 | 466.1 | 613.7 | 1232.1 | 438.0 | 345.9 | 758.4 | |||||
| spark rc | scott | L | xc | 100 | 110 | 2 | 180 | 2 | 160 | 68.5 | 73.8 | 80 | 740 | 456.8 | 596.2 | 1158.6 | 435.0 | 319.5 | 756.0 | ||||||
| epic evo | specialized | high | m | dc | 110 | 120 | 4 | 180 | 4 | 160 | 67.0 | 74.5 | 175 | 60 | 760 | 436.0 | 597.0 | 1164.0 | 438.0 | 339.0 | 781.0 | ||||
| epic evo | specialized | low | m | dc | 110 | 120 | 4 | 180 | 4 | 160 | 66.5 | 74.5 | 175 | 60 | 760 | 436.0 | 597.0 | 1164.0 | 438.0 | 336.0 | 781.0 |
EDA
DataExplorer::plot_bar(mtb_data,
ggtheme = theme_classic(),
title = 'Distribution of Categorical Variables',
theme_config = theme(plot.title = element_text(hjust = 0,
color = "slateblue4",
size = 24),
plot.subtitle = element_text(hjust = 0, color = "slateblue2", size = 10),
plot.caption = element_text(color = "dark gray", size = 10, face = "italic"),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14)),
maxcat = 10,
ncol = 2)DataExplorer::plot_density(mtb_data,
ggtheme = theme_classic(),
title = 'Distribution of Continuous Variables',
geom_density_args = list(fill = 'slateblue'),
theme_config = theme(plot.title = element_text(hjust = 0,
color = "slateblue4",
size = 24),
plot.subtitle = element_text(hjust = 0, color = "slateblue2", size = 10),
plot.caption = element_text(color = "dark gray", size = 10, face = "italic"),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14)),
ncol = 3)Average bikes by flip-chip setting
# Split data based on setting vs. no setting
no_setting <- mtb_data %>%
filter(is.na(setting))
setting <- mtb_data %>%
filter(!is.na(setting))
setting <- cbind(setting$model, setting$label, select_if(setting, is.numeric))
setting$model <- setting$`setting$model`
setting <- setting %>% select(-`setting$model`)
setting$label <- setting$`setting$label`
setting <- setting %>% select(-`setting$label`)
mean_by_setting <- aggregate(x=select(setting, -c(model, label)),
by=list(setting$model, setting$label),
FUN=mean)
mean_by_setting$model <- mean_by_setting$Group.1
mean_by_setting$label <- mean_by_setting$Group.2
mean_by_setting <- mean_by_setting %>% select(-c(Group.1, Group.2))
no_setting <- cbind(no_setting$model, no_setting$label, select_if(no_setting, is.numeric))
no_setting$model <- no_setting$`no_setting$model`
no_setting <- no_setting %>% select(-`no_setting$model`)
no_setting$label <- no_setting$`no_setting$label`
no_setting <- no_setting %>% select(-`no_setting$label`)
new_mtb_data <- data.frame(rbind(mean_by_setting, no_setting))
rownames(new_mtb_data) <- new_mtb_data$modelVariation Amongst Featureset
The first thing we’ll do is look to see if any of the features in our dataset are better at explaining the variation amongst the different bikes than other features. That is, it’s completely possible that two features are similar and don’t have much variation in them, even across some of the different bike categories. To do so, we’ll:
- Look for highly correlated features and flag these for potential
removal;
- Run Principal Component Analysis (PCA) to see if certain features are better at explaining the variation in our data better than others.
1. Correlation
First, let’s take a look at our most highly correlated features.
We’ll use the corrplot() function to better order the
highly correlated features by the angular order of their
eigenvectors.
mtb_correlation <- mtb_data %>%
# Get rid of price for now
select(-price) %>%
# Select our variables of interest
select_if(is.numeric) %>%
# Remove rows with NAs in them
# drop_na() %>%
# Build our correlation matrix, such that missing values are handled by casewise deletion
cor(use = 'complete.obs')
# Convert our results into a tibble for easier manipulation
mtb_correlation_df <- mtb_correlation %>%
as_tibble() %>%
mutate(variable = colnames(mtb_correlation)) %>%
relocate(variable, everything())
# Build our correlation plot, using the angular order of the eigenvectors
corrplot(mtb_correlation,
diag = F,
col = COL2('PRGn'),
tl.col = 'slateblue4',
type = 'lower',
method = 'color',
order = 'AOE',
title = 'Mountain Bike Feature Correlation'
)Here we see some obvious correlations, for example:
- f_piston (front brakes) is perfectly correlated with
r_piston (rear brakes), which makes sense since mountain
bikes tend to use the same types/spec of brakes for the front vs. rear
tires.
- fork_travel has a correlation above .95 with:
fork_travel. This make sense; for example, rear_travel
should be highly correlated with fork_travel.
In all, here are the most highly correlated variables (i.e. variables which have a correlation above .95 or below -.95):
mtb_correlation_df %>%
pivot_longer(-variable,
names_to = 'correlated_variable',
values_to = 'correlation') %>%
filter(variable != correlated_variable) %>%
filter((correlation > .95) | (correlation < -.95)) %>%
pander()| variable | correlated_variable | correlation |
|---|---|---|
| f_piston | r_piston | 1 |
| r_piston | f_piston | 1 |
| seat_angle | stem_length | -0.9506 |
| stem_length | seat_angle | -0.9506 |
| stack | wheelbase | 0.9512 |
| wheelbase | stack | 0.9512 |
For now, we’ll opt to include everything. But later on, as we analyze the importance of different features, we’ll look to remove some of the above variables first.
2. PCA
Next, we’ll apply PCA to our dataset. In so doing, we’ll have to center and scale our data given how different the ranges are for certain measurements. Let’s take a look at our 5 principal components which explain the largest proportion of variance in the data:
# Impute missing values with column mean (not really best practice, but good enough)
for (c in 1:ncol(new_mtb_data)){
if (is.numeric(unlist(new_mtb_data[,c]))){
# print(colnames(new_mtb_data)[c])
new_mtb_data[is.na(new_mtb_data[,c]), c] <- mean(unlist(new_mtb_data[,c]), na.rm=TRUE)
}
}
# TODO get average bikes by setting to work above
mtb_no_null <- new_mtb_data %>%
select(-price) %>%
select_if(is.numeric) %>%
bind_cols(label = new_mtb_data$label) %>%
drop_na()
# head(mtb_no_null)
mtb_pca <- prcomp(mtb_no_null %>% select(-label),
center = TRUE,
scale. = TRUE)
# Put our summary results into a dataframe
mtb_pca_df <- tibble(variable = c('Standard Deviation', 'Proportion of Variance', 'Cumulative Proportion')) %>%
cbind(summary(mtb_pca)$importance)
# head(mtb_pca_df)
mtb_pca_df %>%
# Only display the first 6 columns
select(c(variable:PC5)) %>%
pander()| Â | variable | PC1 | PC2 |
|---|---|---|---|
| Standard deviation | Standard Deviation | 2.972 | 1.301 |
| Proportion of Variance | Proportion of Variance | 0.5194 | 0.09957 |
| Cumulative Proportion | Cumulative Proportion | 0.5194 | 0.619 |
| Â | PC3 | PC4 | PC5 |
|---|---|---|---|
| Standard deviation | 1.211 | 1.089 | 0.8863 |
| Proportion of Variance | 0.0862 | 0.06972 | 0.0462 |
| Cumulative Proportion | 0.7052 | 0.7749 | 0.8211 |
We can see that, actually, starting at our \(5^{\text{rd}}\) principal component, nearly 82.1% of the data’s variation is properly explained. Let’s take a look at our top 2 principal components:
p_load(devtools,
ggbiplot)
ggbiplot(mtb_pca,
obs.scale = 1,
var.scale = 1,
groups = mtb_no_null$label,
ellipse = TRUE,
circle = FALSE,
ellipse.prob = .5) +
theme(legend.direction = 'horizontal',
legend.position = 'top')# jpeg('../Images/pca.jpg')Here we can see that our top 2 principal components, which explain roughly 61.9% of the variation in our data, are already pretty good representations for describing the different components in our dataset. Even so, the groupings are distinctly plotted on the 2-D graph.
Rowdiness scale
Clustering
K-Means
# How many clusters are necessary? 4
mtb_numeric <- mtb_no_null %>%
select(-label)
mtb_standard_scaled <- scale(mtb_numeric)
mtb_numeric <- mtb_no_null %>%
select(-label)
mtb_numeric <- mtb_no_null %>%
select(-label)
clusters <- 1:10
dists <- c()
for (c in 1:10){
km <- kmeans(mtb_standard_scaled, centers=c, iter.max=1000)
dists <- c(dists, km$tot.withinss)
}
# jpeg('../Images/Kmeans.jpg')
# plot(clusters, dists, type='l', xlab='Clusters', ylab='Total Sum of Squared Euclidean Distances')
# Plot our results
tibble(clusters = clusters,
dists = dists) %>%
ggplot(aes(x = clusters, y = dists)) +
geom_point(size = 3, alpha = .9, color = 'slateblue') +
geom_line(size = 2, alpha = 1, color = 'slateblue1') +
labs(title = "K-Means Clustering of MTB Data",
subtitle = 'Method uses `tot.withinss` parameter to measure distances.',
x = 'Clusters',
y = 'Total Sum of Squared Euclidean Distances')# Let's see where these clusters would end up on the 2D PCA plot
mtb_pca_scaled <- prcomp(mtb_standard_scaled,
center = F,
scale. = F)
pca_2_scaled <- as.matrix(mtb_standard_scaled) %*% as.matrix(mtb_pca_scaled$rotation[,1:2])
pca_km_scaled <- kmeans(pca_2_scaled, centers=3, iter.max=1000)
#Something's not working here, definitely hitting a local min or something
colorgroups <- function(g){
if (g == 'tr' || g == 'Trail'){
return('blue')
}
else if (g == 'xc' || g == 'Cross Country'){
return('pink')
}
else if (g == 'dc' || g == 'Downcountry'){
return('darkgoldenrod3')
}
else if (g == 'am' || g == 'All Mountain'){
return('red')
}
else if (g == 'en' || g == 'Enduro'){
return('green')
}
}
catNames <- c('Cross Country', 'Downcountry', 'Trail', 'All Mountain', 'Enduro')
cols <- unlist(lapply(new_mtb_data$label, colorgroups))
unlist(lapply(catNames, colorgroups))## [1] "pink" "darkgoldenrod3" "blue" "red"
## [5] "green"
# jpeg('../Images/PCA_clusters.jpg')
plot(pca_2_scaled, col=cols)
points(pca_km_scaled$centers, col = 'slateblue4', pch = 'x', cex = 1.5)
# text(pca_2_scaled[,1], pca_2_scaled[,2], rownames(pca_2_scaled))
legend("bottomleft", legend= c(catNames, 'Cluster Center'), col=c(unlist(lapply(catNames, colorgroups)), 'black'), pch=c(rep('o', 5), 'X'))#TODO let's look at this bottom cluster - both Niner bikes
#Niner has low reach numbers on its bikes - could be because we used the Medium for these!
#Based on PCA mapping, the blur tr, expic, Exie, Ripley, and Element all have less chainstay length, and less pistons?? wow, should we exclude piston count?? with more 2 piston bikes getting added, it evens out the average, so these aren't showing up as much anymoreGaussian Mixture Model (GMM)
In this section, we’ll take a more probabilistic model to our clustering. That is, we’ll use a Guassian Mixture Model (GMM) to build out normally distributed subgroupings within our mountain bike dataset, where the densities of each of the subgroupings represents a probability that a bike belongs to that subgrouping. Unlike K-Means, which is a more centroid-based clustering method, GMM is more of a distribution-based clustering method.
p_load(ClusterR)
# Build our GMM model
mtb_gmm <- GMM(mtb_standard_scaled,
dist_mode = 'eucl_dist', # Distance metric to use during seeding of initial means clustering
seed_mode = 'random_subset', # How initial means are seeded prior to EM alg
km_iter = 10, # Num of iterations of K-Means alg
em_iter = 10, # Num of iterations of EM alg
verbose = T
)## gmm_diag::learn(): generating initial means
## gmm_diag::learn(): k-means: n_threads: 1
## gmm_diag::learn(): k-means: iteration: 1 delta: 5.7245
## gmm_diag::learn(): k-means: iteration: 2 delta: 6.42421e-34
## gmm_diag::learn(): generating initial covariances
## gmm_diag::learn(): EM: n_threads: 1
## gmm_diag::learn(): EM: iteration: 1 avg_log_p: -23.9741
## gmm_diag::learn(): EM: iteration: 2 avg_log_p: -23.9741
##
## time to complete : 7.5041e-05
mtb_gmm_pred <- predict(mtb_gmm, mtb_standard_scaled)
opt_gmm <- Optimal_Clusters_GMM(mtb_standard_scaled,
max_clusters = 20,
criterion = "BIC",
dist_mode = "eucl_dist",
seed_mode = "random_subset",
km_iter = 10,
em_iter = 10,
var_floor = 1e-10,
plot_data = T)Use the mclust package in R, which utilizes Bayesian
Information Criterion (BIC) to optimize the number of clusters.
p_load(mclust)
mtb_gmm2 <- Mclust(mtb_standard_scaled)
#or specify number of clusters
# mb3 = Mclust(iris[,-5], 3)
# optimal selected model
# mtb_gmm2$modelName
# optimal number of cluster
# mtb_gmm2$G
# probality for an observation to be in a given cluster
# head(mtb_gmm2)
# get probabilities, means, variances
summary(mtb_gmm2, parameters = TRUE)## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust XXX (ellipsoidal multivariate normal) model with 1 component:
##
## log-likelihood n df BIC ICL
## 1095.022 58 170 1499.768 1499.768
##
## Clustering table:
## 1
## 58
##
## Mixing probabilities:
## 1
## 1
##
## Means:
## [,1]
## rear_travel -0.0000000000000010855331
## fork_travel 0.0000000000000005545634
## f_piston -0.0000000000000010381606
## f_rotor_dim -0.0000000000000007034374
## r_piston -0.0000000000000010381606
## r_rotor_dim -0.0000000000000041790108
## head_angle 0.0000000000000013484653
## seat_angle -0.0000000000000023154629
## crank_length -0.0000000000000018577082
## stem_length -0.0000000000000001790994
## handlebar_width 0.0000000000000039486004
## reach 0.0000000000000013751813
## stack 0.0000000000000068490980
## wheelbase 0.0000000000000078192571
## chainstay_length -0.0000000000000365012607
## bb_height -0.0000000000000083943779
## standover_height 0.0000000000000022679740
##
## Variances:
## [,,1]
## rear_travel fork_travel f_piston f_rotor_dim r_piston
## rear_travel 0.98275862 0.86647130 0.53074705 0.72270044 0.53074705
## fork_travel 0.86647130 0.98275862 0.59701027 0.78438697 0.59701027
## f_piston 0.53074705 0.59701027 0.98275862 0.57760629 0.98275862
## f_rotor_dim 0.72270044 0.78438697 0.57760629 0.98275862 0.57760629
## r_piston 0.53074705 0.59701027 0.98275862 0.57760629 0.98275862
## r_rotor_dim 0.77073412 0.79437848 0.43519186 0.74245994 0.43519186
## head_angle -0.81199232 -0.83055385 -0.44808970 -0.66046683 -0.44808970
## seat_angle 0.59460756 0.63732989 0.36523677 0.56513695 0.36523677
## crank_length -0.05247594 -0.05329849 -0.01705486 -0.04492137 -0.01705486
## stem_length -0.52751652 -0.54137826 -0.45331950 -0.37291853 -0.45331950
## handlebar_width 0.61297175 0.66689437 0.53800825 0.56677331 0.53800825
## reach 0.50878091 0.46844618 0.12052159 0.41797347 0.12052159
## stack 0.62986365 0.70353377 0.40627172 0.46370383 0.40627172
## wheelbase 0.81153584 0.80687536 0.36946161 0.65071880 0.36946161
## chainstay_length 0.21765491 0.24627225 0.23496345 0.33487904 0.23496345
## bb_height 0.60876155 0.66944392 0.45592502 0.52235140 0.45592502
## standover_height -0.17884464 -0.08927132 -0.28317481 -0.04107194 -0.28317481
## r_rotor_dim head_angle seat_angle crank_length
## rear_travel 0.77073412 -0.8119923156 0.59460756 -0.0524759431
## fork_travel 0.79437848 -0.8305538489 0.63732989 -0.0532984850
## f_piston 0.43519186 -0.4480897037 0.36523677 -0.0170548600
## f_rotor_dim 0.74245994 -0.6604668277 0.56513695 -0.0449213682
## r_piston 0.43519186 -0.4480897037 0.36523677 -0.0170548600
## r_rotor_dim 0.98275862 -0.7644254772 0.63374974 -0.0175895936
## head_angle -0.76442548 0.9827586207 -0.74721358 -0.0007695164
## seat_angle 0.63374974 -0.7472135761 0.98275862 -0.0512057171
## crank_length -0.01758959 -0.0007695164 -0.05120572 0.9827586207
## stem_length -0.39794105 0.5763723220 -0.52789789 -0.0783916344
## handlebar_width 0.56297068 -0.6713014498 0.52755131 0.0926088407
## reach 0.45588325 -0.5810910021 0.57203070 -0.0088706963
## stack 0.59439229 -0.6996862164 0.62242136 0.1205464243
## wheelbase 0.71092923 -0.8998065172 0.72976412 -0.0282440001
## chainstay_length 0.15814975 -0.2217584548 0.06649237 -0.0009586544
## bb_height 0.47593674 -0.5987405494 0.45187614 0.1300968576
## standover_height -0.11335279 0.1344210160 -0.28064197 -0.0271607065
## stem_length handlebar_width reach stack wheelbase
## rear_travel -0.52751652 0.61297175 0.508780910 0.6298637 0.8115358
## fork_travel -0.54137826 0.66689437 0.468446179 0.7035338 0.8068754
## f_piston -0.45331950 0.53800825 0.120521586 0.4062717 0.3694616
## f_rotor_dim -0.37291853 0.56677331 0.417973470 0.4637038 0.6507188
## r_piston -0.45331950 0.53800825 0.120521586 0.4062717 0.3694616
## r_rotor_dim -0.39794105 0.56297068 0.455883252 0.5943923 0.7109292
## head_angle 0.57637232 -0.67130145 -0.581091002 -0.6996862 -0.8998065
## seat_angle -0.52789789 0.52755131 0.572030701 0.6224214 0.7297641
## crank_length -0.07839163 0.09260884 -0.008870696 0.1205464 -0.0282440
## stem_length 0.98275862 -0.59053686 -0.329975492 -0.6565999 -0.5781065
## handlebar_width -0.59053686 0.98275862 0.322740937 0.6070998 0.6070346
## reach -0.32997549 0.32274094 0.982758621 0.4927676 0.6938129
## stack -0.65659992 0.60709976 0.492767604 0.9827586 0.7492409
## wheelbase -0.57810653 0.60703461 0.693812924 0.7492409 0.9827586
## chainstay_length -0.10000081 0.34555985 -0.046796458 0.1998316 0.2450669
## bb_height -0.53601700 0.69945043 0.313913983 0.5873574 0.5877167
## standover_height 0.30985818 -0.22425359 -0.112561218 -0.0762681 -0.0522580
## chainstay_length bb_height standover_height
## rear_travel 0.2176549122 0.6087615 -0.17884464
## fork_travel 0.2462722471 0.6694439 -0.08927132
## f_piston 0.2349634516 0.4559250 -0.28317481
## f_rotor_dim 0.3348790372 0.5223514 -0.04107194
## r_piston 0.2349634516 0.4559250 -0.28317481
## r_rotor_dim 0.1581497490 0.4759367 -0.11335279
## head_angle -0.2217584548 -0.5987405 0.13442102
## seat_angle 0.0664923694 0.4518761 -0.28064197
## crank_length -0.0009586544 0.1300969 -0.02716071
## stem_length -0.1000008062 -0.5360170 0.30985818
## handlebar_width 0.3455598514 0.6994504 -0.22425359
## reach -0.0467964577 0.3139140 -0.11256122
## stack 0.1998316255 0.5873574 -0.07626810
## wheelbase 0.2450668561 0.5877167 -0.05225800
## chainstay_length 0.9827586207 0.3448505 0.29091329
## bb_height 0.3448504748 0.9827586 -0.20756940
## standover_height 0.2909132943 -0.2075694 0.98275862
plot(mtb_gmm2, 'classification')Multi-class SVM
p_load(e1071,
caret)
#convert all mountain category to enduro, dc -> Xc?
remap <- function(x, num){
if (x=='am' || x=='en'){
if (num){
return(4)
}
else{
return('Enduro')
}
}
else if(x=='xc'){
if(num){
return(1)
}
else{
return('Cross Country')
}
}
else if(x=='dc'){
if(num){
return(2)
}
else{
return('Downcountry')
}
}
else if(x=='tr'){
if(num){
return(3)
}
else{
return('Trail')
}
}
}
labels <- as.factor(unlist(lapply(new_mtb_data$label, remap, F)))
n <- length(labels)
test_idx <- sort(sample(1:n, round(n/5)))
Xtest <- mtb_standard_scaled[test_idx, ]
Xtrain <- mtb_standard_scaled[-test_idx, ]
trainSVM <- function(x, y, idx){
xtest <- x[idx,]
xtrain <- x[-idx,]
ytest <- y[idx]
ytrain <- y[-idx]
clf <- svm(x=xtrain, y=ytrain)
preds <- predict(clf, xtest)
acc <- 0
cm <- table(ytest, preds)
for (i in 1:length(unique(labels))){
acc <- acc + cm[i,i]
}
return(acc/sum(cm))
}
folds <- createFolds(labels, k=10)
accs <- c()
for (fold in folds){
acc <- trainSVM(mtb_standard_scaled, labels, fold)
accs <- c(accs, acc)
}
mean(accs)## [1] 0.587619
# Roughly 60% accuracy when treating down country as separate category
# But - roughly 77% accuracy when treating down country as XC, only 65% accuracy when treating downcountry as trail, suggests that downcountry bikes are more akin to XC than they are trailfolds <- createFolds(labels, k=10)
accs_2pc <- c()
for (fold in folds){
acc <- trainSVM(pca_2_scaled, labels, fold)
accs_2pc <- c(accs, acc)
}
mean(accs_2pc)## [1] 0.5861472
dat <- data.frame(x=cbind(pca_2_scaled[,2], pca_2_scaled[,1]), y=labels)
pcsvm <- svm(y~., data=dat)
# jpeg('roughSVM.jpg')
plot(pcsvm, dat)# circle -> correctly predicted
# X -> incorrectly predicted
# black -> true XC
# Red -> true DC
# Blue -> true TR
# Green -> true EN